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Summary. Binary trait data record the presence or absence of distinguishing traits in individ- 
uals. We treat the problem of estimating ancestral trees with time depth from binary trait data. 
Simple analysis of such data is problematic. Each homology class of traits has a unique birth 
event on the tree, and the birth event of a trait visible at the leaves is biased towards the leaves. 
We propose a model-based analysis of such data, and present an MCMC algorithm that can 
sample from the resulting posterior distribution. Our model is based on using a birth-death 
process for the evolution of the elements of sets of traits. Our analysis correctly accounts 
for the removal of singleton traits, which are commonly discarded in real data sets. We illus- 
trate Bayesian inference for two binary-trait data sets which arise in historical linguistics. The 
Bayesian approach allows for the incorporation of information from ancestral languages. The 
marginal prior distribution of the root time is uniform. We present a thorough analysis of the 
robustness of our results to model mispecification, through analysis of predictive distributions 
for external data, and fitting data simulated under alternative observation models. The recon- 
structed ages of tree nodes are relatively robust, whilst posterior probabilities for topology are 
not reliable. 

Keywords: Phylogenetics, binary trait, dating methods, Bayesian inference, Markov chain 
Monte Carlo, glottochronology 

1. Introduction 

A great deal of progress has been made on the statistical analysis of DNA sequence data, and 
in particular for model-based estimation of genealogy. No equivalent statistical framework 
exists for trait-based cladistics. However, qualitative and quantitative trait data may be 
used to recover dated tree-like histories in situations where we have no genetic sequence data. 
Progress is possible when the traits are similar in type, so that some unifying assumption 
about their evolution is justified. 

We give statistical methodology for tree-estimation from binary trait data. These data 
are made up of binary sequences, each sequence recording for one taxon the presence or 
absence of a list of traits. Pigeon wings and sparrow wings are instances of the trait "bird 
wings" displayed at the taxa "Pigeon" and "Sparrow". In our model, two instances of 
a trait are necessarily homologous, that is, they descend from a common ancestor. Trait 
observation models have many missing data. Birth times of observed traits are unknown, 
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and traits displayed at less than two taxa may be discarded. We model these missing data, 
integrate them out of the analysis analytically, and measure the random error using sample 
based Bayesian inference. 

The outline of this paper is as follows. In the first half (Sections[2]to[n|) we set up Bayesian 
inference for a class of trait models. We give observation models, likelihood evaluation, prior, 
posterior and a MCMC scheme. In the second half of the paper (Sections [7] to [§]) we apply 
the inference scheme to two closely related data sets. We review previous studies of these 
data and describe the particular models we fit, then present results and model mispecifica- 
tion analysis. Readers interested in the application only should read the first paragraph of 
[2 and all of Section [4] before jumping to the data analysis in Sections [3 and El Graphi cs il- 
lustrating this application make up the bulk of the supplement iNicholls and Gray ( 2007 ). see 
http : //www . stats . ox . ac . uk/~nicholls/linkf iles/papers/NichollsGray06-SUPP .pdf . 
Supplement section labels correspond to the section labels in this paper. 

We begin in Section [3] with the observation process. In Section [3] we give an efficient 
schem e for evaluating the corresp onding likelihood. The model, described in lHuson and Steel 
( 2004 ) and Atkinson et al. (2005), is the natural stochastic process representing Dollo's par- 
simony criterion, since each instance of a given trait descends from a single innovation. In 



Huson and Steell (|2004l ) the traits are distinct genes which are present or absent in an indi- 



vidual, and trees ar e built using a maximu m-likelihood pairwise-distance, and the neighbor- 
joining m ethods of[ Saitou a nd Nei dl987h. Ou r own work has been motivated by a pair of 
data-sets, Dven et al.l ( 1997h and Ringe et al. ( 2002 ). recording trait presence and absence 
for Indo-European languages. Here traits are cognate classes (also called lexical traits), that 
is, homology classes of words of closely similar meaning. Thus English, Flemish and Danish 
share the trait "all/alle/al" whilst Spanish, Catalan and Italian lack that trait, but share 
"todo/tot/tutto". 

In Section[5]we write down two prior probability distributions for trees. The first imposes 
an exponential penalty on branch length. The second is designed to be non-informative with 
respect to the height of the root node, and otherwise uniform on the space of trees. This 
distribution is a new class of tree priors and is likely to be useful in a broader phylogenetic 
setting. General calibration constraints are introduced in Section [4] with specific examples 
in Section 17.21 These constraints, derived from historical records, bound some tree node 
ages above and below and are used to fix trait birth and death rates. They determine 
the parameter space of trees. In Section [5] we gather together the results of Section and 
Section 2] and write down the full posterior density for the model parameters. We give 
cl osed form results for the two leaf case, and verify that we reproduce the distance measure 
of Huson and Steell (|2004h . We give a brief description of our MCMC algorithm for sampling 
the posterior density in Section O 

In Section [7] we introduce the cognate data and the associated calibration constraint 
data and summarize previous work. Section [5] has two parts. Section 18.11 gives further 
details of the model we fit and Section [8.21 summarizes results and conclusions. 

Statistical contributions to the dating of language branching events have been rejected 
by linguists. Dating efforts are criticized for their assumption of a cons tant rate of language 



chang e at all times and in all places, the so called "glottal clock". Bergsland and Vogtl 
( 19621 ) fo und example s of extreme rates, but employed counter-examples biased by data- 
selection. Blust (2000) links rate heterogeneity to long-branch attraction. However, neither 
criticsm conside rs even the ran dom component of the error. In this respect we are repeating 
the comments of ISankofj (|l973l) . In Section M we investigate model errors. Although we find 
evidence for model mispecification we nevertheless reproduce, to within random error, age 
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estimates in analyses across near-independent data, and in reconstructions from synthetic 
data simulated under likely model- violation scenarios. 

2. A model of binary trait evolution 

In this section we specify an observation model for traits evolving on a fixed tree. We preface 
this section with a qualitative description of the model. An individual is represented by a set 
of traits which are distinguishable and non-interacting. As calendar time passes, new traits 
are born into the set at a constant rate. Each such birth generates a new homology class of 
trait instances. At a tree node, two identical copies of the set entering the node leave that 
node. Two instances of each trait entering a node leave the node, one instance in each set. 
Each instance of each trait in each set dies independently at a constant rate. A graphica l 
illustration of the process and notation is given in the supplement . iNicholls and Gravl (l2007h . 
Models of the tree itself are given in Section 0J 

We begin our formal description with notation for the tree. Let g = (E, V, t) be a rooted 
binary tree with L leaves plus one extra node (label R*) ancestral to the root itself. The set 
of all nodes is then V — {1,2,..., 2L}, with leaf nodes Vl, ancestral nodes Va and edges 
G E where i, j G V and i < j. Node ages t = {ti,t 2 , ■ ■ ■ ,*2i) are ordered < t i+1 
and increase from the leaves to the root. The root node label is R = 2L — 1 and there is 
an additional node R* = 2L with age t R * — oo which is connected to the root via an edge 
{R,R*)eE. Our convention is R* Va, so V = {R*} U Va U Vl- Leaves may be staggered 
in time. 

Next we describe the evolution of traits. Sets of trait instances are sets of trait labels 
which evolve along the branches of g from the root towards the leaves, in the direction of 
decreasing age. Identify edge ( i,j ) by the node i at the base of that edge (edges are directed 
for increasing age, from leaf to root, in the opposite direction to the calendar evolution of 
traits, from root to leaf). For G E and r G [ti,tj) denote by (t, i) a time point on a 

branch of g and by 

l9}= U U 

(.j)eEre[t,ij) 

the set of all such points, including points on the edge (R, R* } of infinite length. For each 
branch G E define a set-valued process H(r,i) = {/ii,/i2, . . . , /iat(t.i)} of trait labels 

h a G Z, a = 1, 2, ... , N(t, i). The elements of H(t, i) are realized by a simple reversible 
birth-death process which acts along each edge of the tree. Set elements are born at constant 
rate A. The label for each new born trait is unique to that trait, but otherwise arbitrary. 
Set elements die at constant per capita rate fi. At a branching event (tj, i) G [g], set H(ti, i) 
is copied onto the top of the two branches and (k,i) emerging from i, so that the 

evolution of H(r,j) and H(t, k) is conditional on H(ti,j) = H(ti, k) = H(ti, i). 

The number of elements N(tR, R) in the root set is the number of traits born in 
[£r, oo) which survive to time <r. These surviving traits are generated at rate A(r, R) = 
Aexp(— ju(r — t^)), so their number is Poisson, mean A/ \i. The process may therefore be 
initialized by simulating N(tu,R) ~ II(A//x), and assigning N(tn,R) arbitrary trait labels 
H(t R , R) = {1,2,..., N(t R , R)} to the root set. 

The data = (H u i G V L ), where 



Hi = H(U,i), ieV L 



(1) 
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are an ordered list of the sets of trait labels observed at the tree leaves. Suppose that in 
all there are N distinct trait labels C = (ci,c 2 , . . . , cjv) (so, C — (Ujgy^ili)) displayed at 
the leaves. We can represent the data as N sets of taxa labels also, with set M a giving the 
leaves at which trait c a appears. This representation is D^ N ' = (Mi,M2, . . . , M;v), with 
M a = {i:c a €Hi, i e V L } for a = 1, . . . , N. 

Traits displayed at just one taxon are often dropped from the data. It is argued that 
these singleton traits do not inform tree topology. This is not the case in the model we 
have described, since singleton traits are informa tive of time depth. Referring to the data 
analyzed in Sect ion [71 iGrav and Atkinson (|2003l ) drop singleton traits in their binary reg- 
istration of the [ Dven et al.1 (1997) data-set but retain them in their registration of the 
iRinge et al. ( 2002^ data. Let I c eJ?(t-,j) = 1 if c € H(tj,j) and zero otherwise. The thinned 
data is ={Hi,i^V L ), with 



Hi 



c e H(U,i) 



jev L 



> 1 



i€V L . 



(2) 



We call this observation model, which drops singleton traits, NOUNIQUE, in contrast to 
the model NO ABSENT defined by Equation (J). We write D for generic NO ABSENT or 
N OUNIQUE d ata. 



Felsensteinl (|1992| ) gives the likelihood for a Poisson process acting on a finite state space, 



along t he branches of a tree, conditioned to show states other than the zero state at the 
leaves. iLewid ( 200lh proposes applying certain trait models of this kind (so-called Jukes- 
Canto r models) to morphological character data, in a maximum likelihood analysis. Lewis! 



200lh mentions the problem of thinning traits displayed at a single taxon, and treats it 



by ensuring the data are not so thinned. iNvlander et al.l (2004) fit models from the same 



family, allowing for the thinning of all parsimony uninformative characters (traits displayed 
at 0, 1, L — 1 or L leaves). These models do not constrain a trait to be generated at a single 
birth event. The authors model a fixed number of traits which move back and forward 
between different categorical values indefinitely. The number of distinct traits is fixed for 
all time. We impose a single birth event for a trait and an evolution which proceeds from 
absence to presence to absence only. The number of distinct traits generated by our process 
is random, so that the total number is info rmative of the rela tive rates of birth and death. 
The model we have described resembles the Watterson (|l975l) infinite sites model, but here 
trait-death is in effect ba ck-mutation. Our model is similar to the infinite alleles model of 
Kim ura and Crow though the number of alleles is not random, whilst the number 

of traits is random. 



3. Likelihood calculations 

The likelihood for g,fi and A is given in terms of the distribution of the point process of 
birth points for those traits displayed in the data. Let X = {X%, X2, . ■ ■ , Xn} be a random 
set of trait birth-points in [g]. The Poisson process generating X is obtained by thinning 
realizations of a constant rate process. Suppose a trait with label c is born at z 6 [g]\ 
let O(z) = J2 ieVL ^ceHt give the number of taxa displaying trait c (after any thinning). If 
Pr{0(z) > d\z, <?, fj,} is the probability for a trait, born at z £ [g] to appear in the data at d+l 
or more leaves, then the trait birth-rate at z in process X is A(z) = APr{0(z) > d\z,g, /x}, 
where d = under the NOABSENT observation model and d = 1 under NOUNIQUE. 
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The distribution of X is denned on the space X of all finite subsets x C [g]. For 
/:[<?] — » 5ft, define the integral Jj g j f(z)dz along tree branches by 

/ /(*)dz = I 3 f{{r,i))dr. 

Now, suppose X = x with a; = {x\, x%, ■ ■ ■ , sat} and x a — (r a , i a ) for a = 1, . . . , N, so that 
.t q G [<?] identifies the point on the tree where trait c a was born. Let dx a = dz at x a = z. 
The density of the random set X = x, with respect to dx = dx\dx2 ■ ■ ■ dxN on X, is 



fx(x\g,n,\) 



exp - / \{z)dz TT A(x a ) 
V J \g] J a=l 



The total number of distinct traits in the data N ~ II ( A(z)dzj has a Poisson distribu- 
tion. 

Trait birth points are nuisance parameters, which we integrate out of the likelihood 
under the density fx- Denote by Pr{Af a = m a \x a , g, fi, 0(x a ) > d} the probability for a 
trait, born at x a , to be displayed at the leaves listed in set m a and no others, conditional 
on being displayed in at least d leaves. The likelihood, P(D\g, fi, A), is 



P(D\g,[i,\) = / P(D\x,g,fj,)f x {x\g,fJ,,\)dx 
J x 

- f, , \(z)dz N 



A / Pr{M« = m a \x a ,g,fi,0(xa) > d}Pr{0(x a ) > d\x a ,g,n}dx a . 

a=l J l9] 



Nl 

a=l 

The outcome {M a = m a ,0(x a ) > d} is identical to the outcome {M a = m a } for traits 
in the data, since those traits already satisfy the thinning condition cardm a > d for each 
a = 1, 2, . . . , N. It follows that events in the data satisfy 

or,, i n , N . n Pr{M a = m a \x a ,g,^} 
Pr{M a = m a \x a ,g,n,0(x a ) > d} = -, 

Pr{(J(Xa) > d\X a ,g,Hi 

and consequently the likelihood is 



P{D\g,n, A) = — exp ( - / X(z)dz) TT A / Pr{M a = m a \x a) g, [i}dx a . 

N] V J W / a=l •%] 



(3) 



We compute A /r , Pr{0(z) > d\z, g, fi}dz and the f actors A Jj n j P r{M a = m a \x a , g, fJ*}dx a 
using recursions related to the pruning recursion of Felsensteml (| 1 9 8 lh . We begin with 
A Pr{0(z) > d\z,g,(i}dz. A birth at a generic point (r,i) can be shifted to the child 
node, (ti,i), 

Pr{0(r,i) > d\(T,i),g,fi} = Pr{0(U,i) > d\(U, i), g, fi} exp(-^(r - t l )), 
and the integral over [g] reduced to a sum over contributions from edges: 

xf Pr{0(z)>d\z,g,fi}dz=- V PY{0(U,i)>d\(U,t),g,v}(l-e-^-^). (4) 



6 

We are interested in the cases d = and d = 1. Let uf 1 = Pr{0(ij, i) = ci|(ii, i), 3, /i} so 



Pr{0(ii,i) > d|(ti, 



1 - ?4 0) d = 0, 

i_ u f)_ u w d=1 . 



We give recursions for the u| . Consider a pair of edges (j, i), (fc, i) in U. Let 5jj 
e -n(ti-tj)^ rjijjg recurs i ons 



= ((1 - <M + «ij«} 0) ) ((1 - s itk ) + *i, fc 4 0) ) 



(0) 



(5) 



are evaluated from ul -* = and = 1 at leaves i e Vl- 

We need to compute A H , Pr{M a = m a |a: a , <7, /x}da; for generic trait patterns. Trait 
c a is born into an edge ancestral to all the leaf nodes which display it, so the edges of 
g which contribute to the integral dx a are those edges, E a say, on the path to node R* 
from the most recent common ancestor of the leaf nodes in m a . Also, m a is non-empty, so 
Pr{M a = m a |(r, i), g, /1} = Pr{M a = m a |(ij, i), g, /z} exp(— fi(r — U)). We write the integral 
over [g] in terms of a sum over contributions from edges: 



xf Pr{M a = m a \x a ,g,fi}dx a = - V" Pr{M a = m a \(ti, i), g, 
7 M 



e-^fe-t*)). ( 6 ) 



Let V£ be the set of leaf nodes in V descended from node i, including i if node i is a leaf. 

For leaf sets m a let mi = V£ n m . Consider two edges {k,i) \n E. Events are 

independent down the two branches, 

Pr{M« = m« I (U, i), g, M } = Pr{M« - m« | (t t , j), g, M } Pr{M« = m^lfe, fc),S,M}, 
and moving from the top (U,j) to the bottom (tj,j) of branch 

«*NUUrt - { 7**f ■^••»-'-"> "i;^' (7) 
[ (1 - ^j) + Siju) if m^ = 0. 

The recursion is evaluated from the leaves, 



Pr{M«)= m W|(t i ,j), 5>A i} 



1 if j is a leaf and rria — {j}, 
if j is a leaf and rria = 0. 



The recursion need not reach the leaves. It can be evaluated from nodes j satisfying 

%j is computed for the J, g , 



la = 0, using Equation |J7]) since u' ^ is computed for the J, , A(z)e?z evaluation. 



4. Prior models on trees 

In this section we specify two families of probability distributions over trees, which we use 
to represent prior information concerning the phylogeny. 

One tree prior we use is a branching process Gl with rate 9 stopped at the instant of 
the Lth branching event (counting the branching at the root). Denote by T the space of 
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G^-realizable trees and by dg the measure IlieVA with counting measure on topologies. 
The process Gl determines a density 

/ G (. 9 |0)cx^- 1 exp(-0| g |) 

with respect to dg, where \g\ is the sum of all branch lengths, excluding the branch ( R, R* ). 
The same functional form of the density is used when tree leaves are offset in time. 

In Section [3 a hypothesis of the form Hr G [t m im ^max]" is central. This motivates a 
prior which is non-informative with respect to such hypotheses. One prior which is strongly 
informative for is the prior density f(g\T) ex I 4h <t, the uniform distribution over all 
trees in T with root age smaller than T, a fixed upper limit. We find that, for trees with 
isochronous leaves at t% = ti = . . . = ti, = 0, the marginal distribution of is t^~ 2 (for 
each g € T the topology-constrained volume integral J dt^+i • • ■ <^2L-2 °c tn L ~ 2 )- This 
prior represents a state of belief in which Pr{tn E [T/2,T]} is about 2 L_1 times greater 
than Pr{tji G [0, T/2]}. The marginal density of tn in the prior 

f R (g\T) cx t 2 R L l tR < T 

is uniform in [0,T]. 

In Section [7721 certain groups of taxa, called clades, are known to group together on the 
tree. Upper and lower bounds on the age of their common ancestor are used to calibrate 
rate parameters. Admissible trees g G V , V C r, satisfy these prior calibration constraints. 
Where calibration constraints are imposed, the prior /r must be modified, in order to 
maintain a uniform prior distribution for the root age. Nodes in clades with clade root 
times bounded above by calibration constraints do not contribute a factor tn to the tree- 
topology constrained volume integral J Yiiev A \{R.} c ^ i ~ ^^ ie density fn must be further 
modified to take into account non-isochronous leaf dates. The exact result is beyond us. 
However, if S is a list of free nodes, ie nodes i £ Va \ {R} outside or above root-bounded 
clades, and for node i S S in tree g S V , Si is the minimum time- value node i can achieve 
in any admissible tree, then 

fR.{g\ T ) h R <T Yi^R ~ 
ies 

gives a reasonably flat marginal distribution for large compared to the calibration dates. 
Refer to the supplement for the results of prior simulation. In the examples following 
Section we summarize posterior distributions computed under tree prior f = fa, with 
clade calibration constraints. Results for the prior fa are similar, and are displayed in the 
supplementary material. 

We encounter data in which leaf node times are themselves subject to uncertainty. 
Calibration data on leaf node times allow leaf times to vary in a range, so that for each 
i G Vl, ti G [t^,tf]. The leaf times ti,i G Vl become missing data. In Section [71 the 
allowed range for leaf times is small compared to the time over which traits evolve. We take 
a prior uniform in [t~ , tf] for fj,i G Vl- 

5. Posterior distributions 

Our final expression for the likelihood is obtained by substituting Equation ((4| and Equa- 
tion ([6|) into Equation ((3|) , and evaluating these terms using Equation (O and Equation (UJ 
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respectively. Multiplying that likelihood by the tree-prior fa given in Section 2] and a prior 
density p(p, A, 9) for our rate parameters, we obtain the posterior distribution 



p(g,p,X,9\D)d9dXdpdg <x exp ^ Pr{0(ti,i) > d\(t u i),g, p}(l - e"^-**)) 



N 



^-x e -e\ g \, 



(8) 



x II E Pr{M = m |(t i ,*) j5 ,A*}(l-e-' , ^- t *^ 

AT 

' l p(p,X,9)d9dXdp J[ dt t . 

Equation ((HJ) holds for tree prior /g- Under tree prior fn(g\T) we drop parameter 9 from 
the posterior and replace fa oc L_1 e _e l s l with /r oc i^~ L ]I fn <T. 

Time scale is undetermined under scale invariant priors p(p, A, 0) = (pX9)~ 1 . For p > 0, 
the transformation (t%, . . . , £r, /Lt, A, 9) — > (t%/p,..., Ir/ p, up, Xp, 9p) leaves /i, A, 9\D)dgdXdpd9 
invariant, so it cannot be a proper distribution. The problem remains (for p > tn/T) under 
tree prior Date calibration data described in Section 0] and Section TT% restricts the 
space of tree states from T to V, and thereby breaks the time-rescaling invariance. The 
posterior becomes proper. 

The special case of two taxa (so, = t\ < t 2 < with tn = the tree height) is 
of interest for checking and debugging. The data, = (Hi,H 2 ), are tw o lists of trait 

instances, including traits present at just one leaf. iHuson and Steel compute the 

MLE, \g\*, for the total tree length \g\ — 2tn — t% — ti in a two leaf tree directly, using the 
reversibility of the birth-death process of traits between the two leaves, and conditioned on 
X/ p known. In this way they motivate a new measure of the distance between two binary 
sequences as the pairwise maximum likelihood distance between the two sequences. Let 
n\ = card Hi\H 2 , n 2 — card H 2 \ Hi and n\ 2 — card (Hi n H 2 ), so that TV = m 2 + m + n 2 . 
The data D amounts to rii, n 2l rt 12 in the two leaf case. The likelihood for the two leaf case, 
computed from Equation ©, using Equation ^ and Equation ©, is 

N 



P(ni,n 2 ,ni 2 ||g|, X,p) oc 



exp 



A r 

/./, 



-P-\9\ 



-A*|ffl»i 



Maximizing this expres sion over \g\ given X/ p we recover the branch length calculated in 
Huson and Steell ((2004) . If instead we maximize P(ni,n 2 ,Tii 2 \\g\, X, p) over A and \g\ > 
t 2 — ti we get an estimate |<7|* for the time separation of two taxa, 

1 



log 1 



m + n 2 
2ni 2 



(9) 



Swadeshi (|1952i ) fits a relation of this kind to lexical trait data. 

The posterior distribution for \g\ given p, which is available in closed form for the two leaf 
tree, is useful for debugging MCMC code. Taking priors p(X,9) — (X9)~ 1 in Equation ([5]) 
and integrating out A and 0, we obtain, 

ril+«2 



P 



, 1 

p,ni,n 2 ,ni 2 ) oc —— 

ml 



e -iAa\ 


"12 


"1 


_ e -n\g\ ' 


2 _ e -n\g\ 




2 


— e~Hsl 



(10) 



Here p and \g\ appear in the combination p\g\. When we consider large trees, and estimate 
p, calibration constraints fixing the age of clades in g separate this pair of variables. 
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Table 1 . A miniature lexical "data set" wit h L = 3 language s, K = 3 meanings and 
N = 6 distinct traits, C* = {1, 2, . . . , 6}, from lDven efail i 19971) . 





"to give" 


"big" 


"we" 






ft = 


= 1 


k -- 


= 2 


k = 


= 3 


Flemish 


geven 


groot 


wy 




i = 1 


c = 


= 1 


c = 


= 3 


c = 


= 6 


Danish 


give 


stor 


vi 




i = 2 


c = 


= 1 


c = 


= 4 


c = 


= 6 


Kashmiri 


dyunu 


bodu 


asi 




i = 3 


c = 


= 2 


c = 


= 5 


c = 


= 6 



6. Markov chain Monte Carlo 

We work exclusively with the marginal posterior density p(g, (i\D). When the prior for A is 
A -1 , this variable is Gamma distributed in the posterior, and may be integrated. The same 
observation applies to 0, when we use the fc prior. Sampling the posterior distribution 
p(g, p\D) via Metropolis-Hastings Markov chain Monte Carlo is fairly straightforward, once 
efficient schemes for evaluating and updating the recursions, Equations (JSJ) and have 
been implemented. 



We use the tree operations described in iDrummond et alj (|2002[ ). These include up 



dates which alter the tree topology, updates which vary node times, updates which vary 
parameters such as fi, and updates which make some combination of these changes. In 
a specimen update we generate candidates for Metropolis-Hastings updates by simulating 
p ~ f/ (1/2, 2) and setting t' = pt and p! = p/p, since this is expected to be a ridge direction 
of the loglikelihood. In the acceptance probability for this update, the probability density 
to generate the reverse update, with p' = l/p, is equal to the probability density to gener- 
ate the forward update, and a Jacobian term \d(g' , p! ', p')/d(g, p,p)\ = p L ~ 2 appears in the 
Hastings ratio. The calibration constraints fix certain taxa groupings as clades, and bound 
the age of the most recent common ancestor of certain clades of taxa. These constraints 
are implemented by rejecting proposed states that violate the constraints. 

Our MCMC convergence analysis, based on monitoring the asympt otic behavior of the 



autocorrelation for p, <r, and the log-prior and log-likelihood, follows iGeverl dl992). We 
made a number of checks on our implementation. We check that the computer function 
for the likelihood Equation ((3]) sums to one over data. We check that the marginal prior 
distribution of tji under /r with isochronous leaves is uniform. We recover the posterior 
distribution in Equation (fTU|) in the two leaf case. We fix a data set and vary the proportions 
in which update types are used. We check that statistics computed under the posterior do 
not vary, to within estimated errors. We recover the parameters of synthetic data, and the 
posterior distribution concentrates on the correct parameter values as the number, N, of 
traits displayed in the data increases. 



Data 



7. 1. Word lists 

In the Dven et al. ( 1997 ) and Ringe et al. ( 20021 ) data, a trait is a homology class of words. 
The setup is illustrated in Table Q] A set of K meaning categories are chosen and, for each 
of the L languag es in the stud y, words i n the K meaning categories are gathered. The 
Dven et al. ( 1997 ) data uses the I Swadeshi ( 1952 ) "word list" (in fact a list of meanings). In 
this list, K = 200 core meaning categories ("All", "And", "Animal",. . . ) are given. Words 
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in the Swadesh mea ning categories ar e relatively resistant to lateral trait transfer, referred 
to here as borrowing. lEmbleton ( 1986I ) observes that words borrowed from French and Latin 
mak e up about 60% of the English lexicon, but less than 6% of the Swadesh 200-word list. 
The Ringe etahl ( 2002h data we have uses a list of K = 328 meanings, (plus morphological 
traits, which we do not treat). There is a Swadesh list of K — 100 meaning categories 
thought to be particularly resistant to borrowing. The word lists are nested, so both data 

sets include the 200- word and 100- word lists. 

In the following, trait data collected by I Gray and Atkins on (2003) for Hittite , Tocharian 



A and Tocharian B are analysed with 84 languages (displayed in Fig. [3]) from the Dven et al 



(1997) data. These merged d ata are refer r ed to hereafter as the lDven et al.l (|l997 ) data. Of 
the L = 24 languages in the iRinge et~aH d2002l) data (d i splaye d in Fig. @]), 20 are ancient. 
In contrast, of the L = 87 la nguages in the iDven et al ] (|l997h data, just the three added 
bv lGrav and Atkinson! (|2003h are ancient. The two data sets are substantially independent. 
Both data sets are available in electronic format. 

The linguist identifies homology classes among the words in a given meaning category. 
In order to avoid false identification of homology, where there is merely a chance likeness 
of sound, linguists require close correspondence of meaning. Where words are judged to 
be descended from a common ancestor they are assigned the same trait label. This oper- 
ation, which requires expert knowledge, is equivalent to replacing words with trait labels, 
c G C, and thereby generating for each language i = 1, 2, . . . , L and each meaning category 
k = 1, 2, . . . , K a trait set h[ . In the context of this application, homology classes of traits 
are called cognate classes. Both data sets mark some cognate classes as equivocal, and offer 
"splitting" and "lumping" versions of the data. We present results for the "splitting" data 
which assigns separate labels to cognate classes which may in fact display a single homolo- 
gous trait. Results f or the lumping data are ve ry similar. We comment on this systematic 
error in Section WM jGrav and Atkinson! ( 20031 ) register the "splitting" Dven et al. ( 1997 ) 



and lRinge et alj (|2002r ) cognate data respectively as 87 x 2665 and 24 x 3174 binary matrices 
In the example in Table [2 the data is coded H\ = {1}, H\ = {1}, H\ = {2},. . . ,fff = 
{6}. Looking at Section^ we have an extra superscript (k) on trait-sets Hi marking the 
meaning class. In Section [5] we start with one independent copy of the trait birth-death 
process H(r,i) for each meaning category. 

The vocabularies of some ancient l anguages are only partially reconstructed, creating 
gaps in the binary sequence data. The IRinge et al.l (|2002l ) data marks these gaps. We are 
una ble to treat missin g data at this stage. We are obliged to drop from the analysis of 
the IRinge et al. ( 2002 ) data the languages Gothic, Lycian, Luvian, Oscan, Umbrian, Old 
Prussian, Old Persian, Avestan and Tocharian A, leaving the languages in Fig.|4j We retain 
some languages with small numbers of gaps, simply marking the gap as trait-absence. We 
discuss the associated model mis-specific ation bias in Section 19.71 The number of gaps in 
our registration of the IDven et al.l (|1997l ) data is negligible. 



Atkinson et al 



7.2. Calibration data 

Histor ical sources provide rate calibration data for these Indo-European data sets 
(2005) compile calibration points. For example, the Brythonic languages Welsh N, Welsh_C, 
Breton_List, Breton_SE and Breton_ST form a clade in the Dven et all ( 1997 ) data, with 
a common ancestor between 1450 and 1600 years before the present (BP, where the present 
is the year 2000 - only roughly the time the data was gathered, because the dating accu- 
racy is in any case low). In our analysis of the two data sets we imposed 16 groups of 
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taxa as clades: Brythonic, Celtic, Italic, Iberian-French, Germanic, West Germanic, North 
Germanic, Balto-Slav, Slav, Indie, Indo-Iranian, Iranian, Albanian, Greek, Armenian and 
Tocharic. 

The calibration points marked by horizontal bars in the sample tree states below give 
both lower and upper bounds on clade root times. Each such calibration point gives an 
independent estimate for A, fj, and 9. Prior knowledge providing only a lower bound on 
language branching ( "languages A and B were distinct by year C" ) is mo re common, but less 
valuab le, as it does not break the scale invariance discussed in Section[5J lYang and Rannala 
(2006) observe that, in a phylogenetic setting, there is often good evidence for the lower 
bound, but little confidence in the upper bound. The same applies here, since the upper 
bound for a split is supported by the absence of historical evidence for separate vocabularies. 
However, uncertainties in the positions of the upper bounds are of the order of hundreds of 
years, whilst the evolution rates we are calibrating give half lives of thousands of years, so 
we have not pursued this source of uncertainty. We do see one result which suggests that 
a soft upper limit would have improved the analysis. The one incorrect clade age estimate 
(Balto-Slav) we see in the cross-validation study (Supplement, Section 9.4) fell above the 
upper limit of the calibration interval. That upper limit is different from the others, since 
it is not based upon historical texts, but is instead derived from consideration of excavated 
cultural remains. The link between language and excavated culture is obviously weaker 
than language and literature. 



7.3. Previous studies 

The s urvey given in ISankofj (|l973h summarizes models of cognate trait data. 



I Sankoflj 

1973) presents relatively realistic models which are complex and parameter-rich. ISankofg 



1973J) discusses inference based on pairwise distances between the binary-trait data-vectors 
of two lang uages. This mode of inference has been the norm for lexical cognate class 
data. Thus iDven et al. (1992) use classical hierarchical clustering of data-vectors based 
on pairwise distances bet wee n languages to establish a tree o f languages. In contrast, 
I Gray and Atkinson ( 2003 ) use Ronquist and Huelsenbeckl (2003 ) MrBayes software and the 
Bayesian phyloge netic methods of Yang and Rannala (jl997 ). to fit the finite-sites DNA 
sequence model of Felsenstein ( 198ll ). The MrB ayes software allowed the m to account for 
the thinning of traits surviving into zero taxa. iPagel and Meadd {2006) describe and fit 
a related, more realistic, model of cognate replacement within meaning category. These 
models allow t r aits id entified in the data as homologous to arise by independent innovation. 
Warnow et al. ( 20061 ) propose a model in which each homology class has a unique birt h 



event. However, th ere is to date no stat i stical inference for the model. iRinge et al.l (|2002f ) 



lErdem et al] (|2006l ) and Nakhleh et al. (|2005l) reject dating, and avoid explicit modelling. 
They make a parsimony analysis without explicit measures of uncertainty. They allow 
some lateral transfer of traits, and thereby generalize to graphs which are not trees. They 
employ expert linguistic intervention in the inference, which becomes a well informed search 
through phylogenies. In light of the random and systematic error we measure below, we do 
not expe ct estimators of tre e topology related to the mode (ie parsimony) to be adequate. 
However IRinge et~aL ( 2002 ) add morphological traits. These may be more relia ble data 
than cognate traits. Such traits can be analysed in the framework we set out. Garrett! 
( 20061 ) shows that the breakup of dialect continua into languages is not tree-like. The local 
borrowing model we give in Section l9.ll generates similar model violations. 
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8. Inference 



8.1. Fitted Models 



When we fit the model of Section[2]to the lDven et al.l (|1997l ) and lRinge et al.l (|2002T ) data, we 
identify a model mis-specification problem. For meaning classes k = 1, 2, . . . , K denote by 
H^ k \r, i) a trait birth-death process modelling the evolution of words in meaning category 
k, so that for i = 1, 2, . . . , L, H\ k) = H^(t t ,i) is the data at the leaves under NOABSENT. 
Let A^) and jjS k ^ be the birth and death rates for traits in meaning class k. It is reasonable 
to expect any real language to have at least one word in each of the semantic fields in the 
Swadesh 200-word list at all times. It follows that the birth-death process must satisfy a 
no-empty-field condition, H^(r,i) ^ or N(r,i) > 0, for each (t, i) G [<?]. 

We ignore this no- empty- field condition in our analysis. We lump together the K copies 
of the birth-death process of traits corresponding to the different meaning classes. Under 
the empty-field approximation, and assuming the death rates [i = ^ k \ k = 1,2, . . .,K are 
all equal (see Section |£L3|) . the superposition 



A" 



H(T,i)=\jH^(T,i) 



(11) 



fe=l 



of birth-death processes generates another instance of the same process, with birth rate 
A = ~^2 k \( k > and death rate fi. If the mean number of words per meaning category is 
large, then the process does not visit the constraint, so the approximation holds. As we 
show in Section 19.51 this condition is not satisfied. However, when we study synthetic 
data simulated under the empty-empty-field condition, the calibration constraint forces the 
fitting procedure to adapt the approximating model to data by distorting estimates of fi, 
and thereby reproduce the uncalibrated clade root ages very well. 

We c arry out MCMC fr om p osterior distr i butio ns p(g, n\D) determined by Equation ((SJ) 
and the iDven et alj (|l997T l and iRinge et all (|2002t ) data, under t he NOUNIQUE ob serva- 
tion model. We repeated the analysis with NOABSENT for the IRinge et ahl (|2002h data, 
obtaining similar results. We apply the branching process prior Jq with prior 1/9, and the 
uniform root prior fn with T = 16000 (an uncontroversial upper limit on tn). The data 
overwhelm these two priors, differences between posterior estimates obtained under the two 
priors are slight, and we therefore discuss results for the prior /r in this paper and very 
briefly report /c-results in the supplement. Results are completely insensitive to the choice 
of T, for all T sufficiently large. 

In our search for conflicting signals in the data, we analyzed (in addition) subsets of 
the data. As discussed in Section |9~TI analyses of subsets of languages may be less exposed 
to error due to certain forms of borrowing. On the other hand we may uncov er rate het 
erogen eity between word lists or between groups of languages. We r educe the IDven et al 



(1997) data to the Swadesh 100- word list, and th e lRinge et al.l (|2002l ) data to the Swadesh 
200- and 100-word lists. We thin the Dven et al. ( 1997t ) data from L = 87 languages down 
to two sets containing L = 31 languages, and L = 30 languages (the two subsets are dis- 
played in the supplementary material), chosen in such a way that the pivotal calibrating 
dates remain applicable. These two data subsets overlap at 8 languages, but just one of 
the five calibration points has any common data (Tocharian, where there is no choice). We 
label analyses "Data-lst-author/Word List/Number of Leaves". 
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8.2. Results 

Figures [1] and [5] give a compact quantitative summary of the Dyen/200/87, Dyen/100/87, 
Dyen/200/31, Dyen/100/31, Dyen/200/30, Dyen/100/30 Ringe/328/15, Ringe/200/15 and 
Ringe/100/17 posterior distributions. The posterior probabilities for a selection of clades 
are displayed in Fig. [TJ and clade labels BGCI, GCI, CI, CG, Gl, GrA, notHT and notH 
defined. The posterior mean age for the common ancestor of the languages defining each 
corresponding clade is displayed in Fig. [51 This format is useful for identifying conflict 
between data subsets, once clades of interest have been identified. 




Fig. 1. Posterior probabilites for selected clades, across data sets, z-axis labels: BGCI, Balto- 
Slav-Germanic-Celtic-ltalic; GCI, Germanic-Celtic-ltalic; CI, Celtic-Italic; CG, Celtic-Germanic; Gl, 
Germanic-Italic; GrA, Greek-Armenian; notHT, complement of Hittite-Tocharian; notH, complement 
of Hittite. 



In the supplement, iNicholls and Gravl (|2007l ), we define and display consensus trees, a 



central point estimate for topology and branch length. The consensus tree is not a state in 
the sample space of trees. Prior constraints and leaf ages are represented on sampled states 
so we give, in Figures [3] and [H samples drawn from the Dyen/100/87 and Ringe/100/17 
posterior distributions. 

Referring to Fig. [1] there is some conflict in the support for clades across analyses. 
The BCIG group is strongly supported in all analyses (except Ringe/100/17, which does 
at least allow it). The CIG group is supported in all analyses (except Dyen/100/30, which 
allows it). However all the sub-clades CI, CG and IG are at odds with at least one data 
set. Age estimates for the common ancestor of the languages in the CIG clade are, in all 
analyses, close to the age estimates for the common ancestors of the subclades, suggesting 
the breakup occurred in a relatively small interval of time, so the split structure is poorly 
resolved. 

Referring again to the clade probabilities, Fig. [TJ and the consensus trees in the sup- 
plement, Hittite and Tocharian form an outgroup in the three Dyen/200/K analyses, are 
grouped with Greek and Armenian in the Dyen/100/y analyses, and are split in the three 
Ringe/X/F analyses. There are many model mispecification issues for Hittite and Tochar- 
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Fig. 2. The mean posterior ages, in years BP, for the most recent common ancestor (MRCA, super- 
cede root) of languages in selected combinations of clades (ie super-clades). x-axis labels as for 
Fig.ffl 



ian. Comparing notH and all in Fig.[2j Hittite adds 1000 years to the poster ior m ean root age 
in the Ringe/328/15 analysis. The contrast between the Dyen et al. ( 1997 ) and lRinge et al 



(2002) analyses is most clearly visible in the notH and notHT columns of Figures Q] and [2] 



In contrast, conflicts between analyses of the Swadesh 100-word list (Dyen/100/87, 
Dyen/100/31, Dyen/100/30 and Ringe/100/17, symbols + * DA) are almost absent. Both 
CI and IG (see Fig.Q]) are allowed by these analyses. Ringe/100/17 allows the clade HT and 
does not impose HT as an out group (which would be a conflict, as notHT is not a clade of 
Dyen/100/y). In other areas of conflict, the Dyen/100/F analyses allow a GrA clade. This 
lack of conflict comes at the price of greater random error (compared to analyses on longer 
word-lists). One striking conflict remains: the position of Indo-Iranian relative to the root 
is quite different in the Ringe/100/17, and Dyen/100/Y" analyses. 

The posterior mean ages for the notH, notHT, all and GrA, which show particular conflict 
in Fig. [2 are in fair agreement for analyses based on the Swadesh 100- word list. Comparing 
the Dyen/100/30 and Ringe/100/17 analyses, and looking at Fig.[2]and Supplement-Fig. 7, 
the clade root ages for notHT do not agree in either simulation. Otherwise there is agreement 
between the four analyses, on the ten measured ages, under one or both prior weightings. 
This reduced (K = 100) set of traits is chosen to be resistent to borrowing. Posterior pre- 
dictive replicates computed in Section 19.31 show little evidence of rate heterogeneity within 
this class of traits. The corresponding words are relatively well attested in otherwise incom- 
pletely reconstructed ancient languages, so there is little missing data. In Section 19.21 we 
compute posterior predictive distributions for singleton traits in the Ringe/100/17 analysis; 
these agree well with external data. 

In summary, the systematic errors displayed in our four age estimates from the Swadesh 
100-word list are representative. On the other hand, most features of tree topology which 
were in doubt, remain in doubt. 
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Fig. 3. Tree sampled from the Dyen/100/87 posterior distribution. x-ax\s gives age in years. Prior 
constraints on eight clade root and three leafages are indicated by horizontal bars. In order to reduce 
clutter, a single bar shows the two Tocharian A and B constraints, which are near equal. 
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Fig. 4. A sampled state illustrating the Ringe/100/17 posterior distribution. Prior constraints on 
topology impose 7 clades. Prior uncertainties in clade root and leaf ages are indicated by horizontal 
bars. 



9. Model mis-specification 

We head this section with a summary of its results. These results coincide with the conclu- 
sions we draw from the between-data analyses in Section [8. 21 our age estimates are robust; 
tree topology less so. In Fig. [5] and Fig.[5]we present results from synthetic data, simulated 
on a tree sampled from the posterior distribution of the Ringe/200/15 analysis (true clade 
st ructure is marked in Fig. [5] with below false clades and 1 below true; the true tree is 
in Nicholls and Gray (2007)), under a range of observation models intended to mimic likely 



model mis-specification. Details of these models, which simulate the empty-field-condition, 
plausible levels of borrowing and branch- wise and trait-wise rate heterogeneity, are given in 
Sections O through \£7\ 

Clades imposed in reconstructions from synthetic data are indexed S- and are the same 
as the clades defined below Fig. [1] and displayed as horizontal bars in Fig. |U Analyses of 
synthetic data are indexed "S/X/Y" . Values of X indicate borrowing and rate heterogeneity: 
X="T" is no borrowing; X="G6" is global borrowing at rate b/i; X="Lz — 6" is local 
borrowing, between languages with a common ancestor not more than z years in the past, 
at rate 6/x; and X="BHp" and X="MHp" have rates drawn independently, for each branch 
(BH) and meaning category (MH) respectively, from a Gamma distribution with mean fj, 
and standard deviation (p/100)^. Values of Y show the constraint applied: Y="Un" is the 
unconstrained birth death process of set elements, with n = X/ju, the expected number of 
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distinct traits at each leaf, under the NOABSENT observation model; Y="Cn" simulates 
cognate classes under the no-empty-field constraint, using n meaning categories. 



Clade support across synthetic data sets 
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Fig. 5. Synthetic data yields estimates of posterior probabilities for selected clades, across syn- 
thetic data sets, z-axis labels as for Fig. Q] with s- prefix indicating synthetic and 0/1 indicating 
absence/presence of the clade in the true tree. 



Clade ages across synthetic data sets 
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Fig. 6. Synthetic data yields estimates of mean posterior ages, in years BP, for the most recent 
common ancestor (MRCA, super-clade root) as in Fig.QJwith s- prefix indicating synthetic. 



Systematic errors in tree node ages inferred from synthetic data generated under these 
models are in general small. With the exception of S/BH50/U200 (see Section the 
systematic error we generated in Fig. [5] is of the same order of magnitude as, or smaller 
than, random error. Systematic error in estimated rates fi (not shown) is highly significant. 
Calibration data fixes dates and topology for that part of the tree adjacent to the leaves, 
forcing the inference to accommodate model mis-specification by adjusting rates. The mod- 
ified rates fit the imposed trait evolution adjacent to the leaves. If model mis-specification 
is homogeneous over the tree, as is the case for the empty-field-approximation, trait evolu- 
tion deep in the tree may be well represented by these biased rates, and date estimates are 
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accordingly robust. 

Tree topology is not robust to the model mis-specification we explored. The "true" 
s-clades S-GrA. S-BCIG, and the rooting clades S-notHT, and S-notH have robust support at 
levels which do not allow rejection of the truth. The "true" s-clades S-CIG and S-IG are not 
well reconstructed when borrowing is substantial. The branching at the top of the superclade 
S-BCIG is poorly resolved as the S-BCIG, S-CIG and S-IG branches are separated by just 
1000 y ears in the tree on which the synthetic data was simulated (see iNicholls and Gray 
(2007)), which is small compared to /i 1 ~ 3000. Nevertheless, the truth is rejected only at 
very high levels of borrowing (S/Gb/Y where b = 0.2, 0.5). Clade age estimates, shown in 
Fig. [2] and Fig. [6l can be stable across analyses when topology is uncertain. This is because 
the super-clade ages are determined largely by total tree length; total tree length is tightly 
coupled to the number of transitions on the tree, which is rather well determined by data. 



9. 1 . Global and local borrowing 

Word borrowing from languages outside the study is straightforward trait birth (unless the 
same word is borrowed into several languages). If we delete a source language from our 
study, we thereby remove the model error associated with borrowing from that language. 
The consistency we see in Fig.[2]between clade ages reconstructed for near-disjoint subsets of 
languages, and the full set, in the Dyen/200/87, Dyen/200/31 and Dyen/200/30 posteriors, 
suggests that borrowing is not distorting the Dyen/200/87 estimates themselves. 

Our models of borrowing are as follows. We associate with each time slice r G [0, oo) 
across the tree a linkage graph {£ (r), V(r)) with nodes, V(r) = {(r, i); (r, i) G [g]}, corre- 
sponding to points in [g] intersected by the slice. The linkage graph models traffic between 
languages; its edges (y,z) G £ (r) connect sets H (y) and H(z) between which trait instances 
can pass. Let V(z) = {z£ V(r) : 3y G V(t), ( y, z ) G £ (r)} be the set of nodes adjacent to 
z G V(r). Let b denote the relative rate of word-borrowing to word-death. At per capita 
rate bfi each instance of each trait in each language in the time slice r generates a borrowing 
event. Suppose the selected trait-instance is in language z G V(t) and is labeled c G C. A 
language y G V(z) is chosen, uniformly at random from nodes adjacent to z on (£ (r), V(r)), 
and we set H (y) <— H (y) U {c}, ie the word is copied into the target language. 

We model local borrowing as follows. Words transfer between languages which have a 
sufficiently recent common ancestor. The linkage graph at time t includes an edge from 
{t,i) to (t,j) if points (£, i) and (t,j) in [g] have a common ancestor less than z years in 
the past. In this model linked groups of languages break up into linked subgroups. In our 
model of widespread borrowing (the "global" borrowing model), all languages communicate 
equally with all other languages, and the linkage graph is the complete graph. 

Our exploration of these models is summarized in Figures[5]and[6]by the three S/G6/U200 
data sets and the S/L500 — 1/U200 data set. We display global borrowing at relative rates 
of 10%, 20% and 50% the death rate. Higher global rates are probably irrelevant. In the 
data, the distribution of card (M a ), the number of languages displaying cognate a, tails off 
rapidly, so that few cognates are displayed in many languages. At b ~ 1, cognates simply 
survive too well, and many cognates from deep in the tree survive into many languages. 
Local borrowing has time depth z = 500 and a borrowing rate equal to the death rate. We 
see from Fig. [6] that age estimates are robust to this form of model mis-specification. 
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9.2. Predictive distributions and external data 

Where the observation model is NOABSENT, singleton traits are present, and we can 
use them to test the model. We drop them from the data, carry out the inference under 
NOUNIQUE, and then see if w e can predict the n umber of singleton traits for each taxon. 
This check was available for the lRinge et al. ( 2002t ) data. We expect rate heterogeneity and 



borrowing to be visible (but probably not distinguishable) in these tests. 

Denote by D synthetic trait data generated under the NOABSENT observation model, 
displaying N distinct traits. For trait a = 1,2, ...,N let M a give the indices of leaves 
displaying an instance of trait a for predicted data D, and X$ be the number of singleton 
traits in D at taxon i, 

X t = card{M : M a = {i}, a = 1, 2, . . . ,N} ieV L . 

The posterior predictive distribution Pr{D\D} is 

Pr{D\D} = J Pr{D\g,n,\}p(g,ti,\\D)dgdiid\ 

and this determines a predictive distribution for Xi. We sample fi, A and g from the poste- 
rior p(g,n,X\D) {g and \x are available from MCMC output; we restore A by sampling its 
posterior conditional density), simulate synthetic data D at the leaves of g, and compute 
Xi from D. Let Xi(D) = card{Af Q : M a = {i},a = 1,2, . . . , N}, denote the number of 
singleton traits at taxon i in the original real data itself. 

Predictive distributions for Xi are giv en in Supplement-F ig. 9. The predictive distribu- 
tions for Xi over-estimate the Xi in the iRinge et al. ( 20021 ) data with K = 328 meaning 



categories. Since borrowing depletes singleton traits, this is consistent with model mis- 
specification due to borrowing. Also, we expect borrowing to be weaker on the shorter 
word-lists (K — 100,200), since the shorter lists are by design more resistent to borrow- 
ing. We see in Supplement-Fig. 9 that singleton traits are indeed more reliably predicted on 
shorter lists (especially K = 100). Rate heterogeneity can mimic this behavior. Correspond- 
ing studies for synthetic data are given in Supplement-Fig. 10,11. Predictive distributions 
from the shorter word-lists are in good agreement with the data. 



9.3. Rate heterogenei ty across traits 

Pagel and Meade (2006) show that the evolution rates of words are, for a given meaning cat- 



egory, fairly consistent across data sets, whilst varying more substantially between meaning 
categories. Fig. [7] displays a tendency for the shorter word-lists to evolve at relatively slower 
rates. This is expected. However the rate variation between data sets in Fig. [7J does not 
lead directly to variation in estimated root times in Fig. [2j For example, Dyen/200/87 and 
Dyen/100/87 differ by a factor 1.5 in posterior mean rate, but by just 1.2 in root age. Time 
depth measurements do not depend on an assumption of constant rates between analyses, 
since rates are estimated from calibration points in the recent history of the same data used 
to predict branching times. 

In order to generate synthetic data with rate heterogeneity across meaning classes (the 
S/MHp/U200 simulations), we draw rates 



^ fc) ~ Gamma(a,,3) 
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independently for each meaning category k — 1, 2, . . . , K, with mean a(3 = /i and variance 
a/3 2 = (r/i) 2 , where r = 0.25,0.5 and r = 1, simulate a trait process H^ k '{r,i) at rate 
A, /x( fc ) , merge meaning c ategories, as Equation (fTTj) . and then read off data, as Equation |J2]). 
IPagel and Meadd (|2006t ) estimate that the rate variance over meaning classes is (rfi) 2 ~ 



/<-' 

Rate heterogeneity across traits distorts the distribution of the number, cardM a , of 
languages in which trait a e C appears. Denote by Y^ n > = Y^ n \D) the number of traits 
displayed at n leaves, 

Y (n) {D) = card{Af a : cardM a = n, a = 1, 2, . . . , N}, 

and let Y^ n ' — Y^ n \D) be the corresponding random variable computed from posterior 
predictive data D ~ Pr{D\D}. We plot E(Y^\D) and the envelope ±2std(f(") \D). 
In the supplementary material (Supplement-Fig. 12) we show that the fitting procedure is 
unable to reproduce the trait frequency distribution in synthetic data with high levels of 
rate heterogeneity across meaning classes (standard deviation 50% of the mean) but lower 
levels (25%) are invisible. 

Returning to the real data, in Supplement-Fig. 13 (left), some inconsistency attributable 
to rate heterogeneity between traits is visible in our Ringe/328/15 analysis. Among other 
problems, the data contains an excess of traits appearing in 10 or more leaves. This is 
caused by a small cohort of traits evolving at death rate /i small compared to the rest. 
The effect is ver y greatly reduced in the Ringe/100/17 analysis (Supplement-Fig. 13, right). 



Analyses of the iDven et alJ (|1997l ) data show a similar pattern 



9.4. Rate heterogeneity in space and time 

Time depth measurements depend on some assumption about the way rates have changed 
over time within each data set we analyze. The variations in rates between clades within 
each of the D100, D200, R328 and R100 groups in Fig. [7] give us an indication of the un- 
modelled rate variation we can expect in the deeper branches of the tree. We give the 
per-trait-instance death rate [i for each clade calibration constraint independently. We 
sampled the posterior distributions p(g, ^|-D c iadc) determined by the data for each clade in 
turn. We could use the posterior rate distribution from one calibration clade as a prior to 
predict the age range for the root of another calibration clade. Where confidence intervals 
for the reconstructed rates of a given data set overlap, the corresponding predictions will be 
good. Such prediction is legitimate within one data set only, so we compare rates between 
vertical dashed lines. 

In the supplement for this section we report and discuss a similar cross-validation exercise 
on the Dyen/100/87 analysis. We drop each calibration-clade in turn (both topology and age 
constraint) and estimate the clade root age using all the word-list data and the remaining 
calibration points. Of 10 such tests, 8 succeed. The predicted age range for Balto-Slav is 
slightly too deep, and that of Hittite very significantly too young. 

Synthetic data with spatio-temporal rate variation (S/BHp/U200 analyses), have rates 

/i(i.j) ~ Gamma(a,/3) 

drawn independently on each edge G E, with mean a/3 = \i and variance a(3 2 — (r/i) 2 , 
where r = p/100, so that the standard deviation of the rates is 10%, 33% and 50% the mean 
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Fig. 7. Posterior mean values of ^, with standard errors at two standard deviations, measured in anal- 
yses Dyen/200/87 (D200-87), Dyen/100/87 (D1 00-87), Ringe/328/15 (R328-15), Ringe/200/15 
(R200-1 5) and Dyen/100/17 (R1 00-1 7), and independently using calibration points in distinct clades. 
The observation model is NOUNIQUE, except for two-leaf clades marked *, where NOABSENT must 
be used. 



rate. iLeed (|l953T ) sees 20% variation in rate estimates from pairs of languages. Results are 
robust to moderate levels of unstructured, random rate variation of this kind. 

Results are of course not robust to structured rate variation, in which, for example, 
rates on edges at ages greater than any calibration point are all larger than any rates in 
the calibration zone, or a single-taxon outgroup has an extreme rate. In S/ BH5Q/U200, 
s — hittite happens to have a high rate. Its root is pushed to great age, and with it goes 
the root of the tree. The analysis is exposed to rare "catastrophic" trait-evolution events 
outside the calibration zone. We checked that that no single language or small outgroup is 
determining the root age in the Ringe/100/17, and Dyen/100/y analy ses. Agreem e nt be- 
tween reconstructions ba sed on the p r edo minantly ancient languages of iRinge et all (j20021 ) 
and modern languages of iDven et al. (1997) shows that there is at least no such structured 
rate variation in the recent past. 



9.5. The empty-field approximation 

Our empty-field approximation will be good if there is significant "polymorphism" , that is, 
if the mean number X^/fJ,^ of traits (ie, words pe r meaning c a tegory ) in the if( fe )(r, z)- 
process is large. W e estimate A / u at 273(9) for the Dven et all (|l997 ) Swadesh-200 data 
and 280(25) for the IRinge et alj (|2002f) Swadesh-200 data (posterior standard deviation in 
parenthesis) and hence A^/m^ — 1-4- The probability, exp(— X^/fi^) ~ 1/4, to find 
the unconstrained trait-set process H^(r,i) in the empty set at any single fixed point 
(t, i) € [g] is high enough to cause concern. 

We simulate synthetic data from the trait birth-death process constrained to respect 
the no-empty-field condition. For each of the k = 1, 2, . . . , K meaning classes, we simulate 
N( k >(tn,R) from a Poisson distribution constrained to be greater than zero, then simulate 
H^ k \T,i)\N^ k \T,i) > in [g\. The total rate for the exponential waiting time to the 
next event does not include fj, if N^(r,i) = 1. We then merge the meaning classes as 
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in Equation (fTTj) . Our studies are represented here by two simulations, S/T/C200 and 
S/L500-1/C200, the latter including local borrowing. The per-capita death rate /i was set 
to a large value, so that polymorhpism was low. We find, when we fit data of this kind, 
that the tree and its dates are robust to this form of model mis-specification. 



9.6. Incorrect splitting deep rooted homology classes 

When the scientist groups instances of traits into homology classes, instances of traits born 
deep in the tree may be highly evolved, and correspondingly difficult to identify as in fact 
homologous. This error can populate the deeper branches of the tree with spurious birth 
events. This is a case where model mis-specification is not homogeneou s over the tr e e, and 



will lead to over-estimati on of the tree dept h. When we replace the iRinge et al.l (|2002f ) 



"splitting" data with the IRinge et all |2002) "lumping" data we do see a 3% downward 
shift in the estimated root time. 



9. 7. Unknown vocabulary as absent tra its 

In our analysis of the IRinge et al. ( 2002 ) data, we retain some languages with gaps, corre- 
sponding to missing data. We replace these gaps with zeros, marking trait absence. Gappy 
languages (Hittite, Tocharian) do stand out in predictive tests counting singleton traits on 
external data for the 328 and 200 word-lists. However, the effect is removed when we reduce 
the data to the Swadesh 100 word-list, where traits are better attested. The effect is to 



bias r econstructed branching times for gappy taxa to larger age values on the IRinge et al 
( 20021) 328 and 200 word-list data (see for example HT in Fig. HJ). 



Acknowledgements 

The authors acknowledge advice and assistance from Qucntin Atkinson and David Welch of 
the University of Auckland, and financial support from the Royal Society of New Zealand. 



References 

Atkinson, Q., G. Nicholls, D. Welch, and R. Gray (2005). From words to dates: water into 
wine, mathemagic or phylogenetic inference. Transactions of the Philological Society 103, 
193-219. 

Bergsland, K. and H. Vogt (1962). On the validity of glottochronology. Current Anthropol- 
ogy 3, 115-153. 

Blust, R. (2000). Why lexicostatistics doesn't work: the "universal constant" hypothesis 
and the austronesian languages. In C. Renfrew, A. McMahon, and L. Trask (Eds.), Time 
depth in historical linguistics, pp. 311-332. Cambridge, UK: The McDonald Institute for 
Archaeological Research. 

Drummond, A. J., G. K. Nicholls, A. G. Rodrigo, and W. Solomon (2002). Estimating 
mutation parameters, population history and genealogy simultaneously from temporally 
spaced sequnce data. Genetics 161, 1307-1320. 

Dyen, I., J. Kruskal, and B. Black (1992). An Indoeuropean classification: a lexicostatistical 
experiment. Transactions of the American Philosophical Society 82(1-132). 



Ancestral trees for binary trait data 23 

Dyen, I., J. Kruskal, and P. Black (1997). Electronic format from website of 3rd author, 
http : //www. cdu. edu. au/research/prof iles/prof ile_black.html 

Emblcton, S. (1986). Statistics in Historical Linguistics. Bochum: Brockmeyer. 

Erdem, E., V. Lifschitz, and D. Ringe (2006). Temporal phylogenetic networks and logic 
programming. Theory and Practice of Logic Programming 6, 539-558. 

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood 
approach. J. Mot Evol. 17, 368-376. 

Felsenstein, J. (1992). Phylogenies from restriction sites: a maximum-likelihood approach. 
Evolution 46, 159-173. 

Garrett, A. (2006). Convergence in the formation of Indo-European subgroups: phylogeny 
and chronology. In P. Forster and C. Renfrew (Eds.), Phylogenetic Methods and the 
Prehistory of Languages, pp. 139-152. Cambride, UK: The McDonald Institute for Ar- 
chaeological Research. 

Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion). Statist. Sci. 7, 
473-511. 

Gray, R. and Q. Atkinson (2003). Language-tree divergence times support the Anatolian 
theory of Indo-European origin. Nature 426, 435-439. 

Huson, D. and M. Steel (2004). Phylogenetic trees based on gene content. Bioinformat- 
ics 20(13), 2044-2049. 

Kimura, M. and J. Crow (1964). The number of alleles that can be maintained in a finite 
population. Genetics 49, 725-738. 

Lees, R. (1953). On the basis of glottochronology. Language 29, 113-127. 

Lewis, P. (2001). A likelihood approach to estimating phylogeny from discrete morphological 
character data. Systematic Biology 50, 913-925. 

Nakhlch, L., D. Ringe, and T. Warnow (2005). Perfect phylogenetic networks: A new 
methodology for reconstructing the evolutionary history of natural languages. LAN- 
GUAGE, Journal of the Linguistic Society of America 81, 382-420. 

Nicholls, G. and R. Gray (2007). Dated ancestral trees from binary trait data 
and its application to the diversification of languages (supplementary material), 
www . stats . ox . ac . uk/~nicholls/linkf iles/papers/NichollsGray06-SUPP .pdf 

Nylander, J., F. Ronquist, J. Huelsenbeck, and J. Nieves-Aldrey (2004). Bayesian phyloge- 
netic analysis of combined data. Systematic Biology, 47-67. 

Pagel, M. and A. Meade (2006). Estimating rates of lexical replacement on phylogenetic 
trees of languages. In P. Forster and C. Renfrew (Eds.), Phylogenetic Methods and the 
Prehistory of Languages, pp. 173-181. Cambride, UK: The McDonald Institute for Ar- 
chaeological Research. 



24 



Ringc, D., T. Warnow, and A. Taylor (2002). Indo-European and Computational Cladis- 
tics. Transactions of the Philological Society 100, 59-129. Data available from 
www . cs . rice . edu/~nakhleh/CPHL/\#datasets 

Ronquist, F. and J. P. Huelsenbeck (2003). Mrbayes 3: Bayesian phylogenetic inference 
under mixed models. Bioinformatics 19, 1572-1574. 

Saitou, N. and M. Nei (1987). The neighbor-joining method: a new method for reconstruct- 
ing phylogenetic trees. Molecular Biology and Evolution 4, 406-425. 

Sankoff, D. (1973). Mathematical developments in lexicostatistic theory. Current Trends 
in Linguistics 11, 93-113. 

Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts. Proc. Am. Phil. 
Soc. 96, 453-463. 

Warnow, T., S. Evans, D. Ringe, and L. Nakhleh (2006). A stochastic model of language 
evolution that incorporates homoplasy and borrowing. In P. Forster and C. Renfrew 
(Eds.), Phylogenetic Methods and the Prehistory of Languages, pp. 75-87. Cambride, 
UK: The McDonald Institute for Archaeological Research. 

Watterson, G. (1975). On the number of segregating sites in genetical models without 
recombination. Theoretical Population Biology 7, 256-276. 

Yang, Z. and B. Rannala (1997). Bayesian phylogenetic inference using DNA sequences: a 
Markov Chain Monte Carlo method. Molecular Biology and Evolution 14, 717-724. 

Yang, Z. and B. Rannala (2006). Bayesian estimation of species divergence times under 
a molecular clock using multiple fossil calibrations with soft bounds. Molecular Biology 
and Evolution 23, 212-226. 



